Search CORE

13 research outputs found

Linear-time String Indexing and Analysis in Small Space

Author: Belazzougui Djamal
Cunial Fabio
Karkkainen Juha
Makinen Veli
Publication venue
Publication date: 01/03/2020
Field of study

The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis: We show that the BWT of a string T is an element of {1, . . . , sigma}(n) can be built in deterministic O(n) time using just O(n log sigma) bits of space, where sigma We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O(n) time and in O(n log sigma) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O(n log sigma) bits of space, took O(n log log sigma) time for the first two structures and O(n log(epsilon) n) time for the third, where. is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend O(n log sigma log log(sigma) n) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Patterns of genetic variation in leading-edge populations of Quercus robur : genetic patchiness due to family clusters

Author: Heikkinen Juha
Huotari Tea
Karkkainen Katri
Rusanen Mari
Vakkari Pekka
Publication venue
Publication date: 01/01/2020
Field of study

The genetic structure of populations at the edge of species distribution is important for species adaptation to environmental changes. Small populations may experience non-random mating and differentiation due to genetic drift but larger populations, too, may have low effective size, e.g., due to the within-population structure. We studied spatial population structure of pedunculate oak,Quercus robur, at the northern edge of the species' global distribution, where oak populations are experiencing rapid climatic and anthropogenic changes. Using 12 microsatellite markers, we analyzed genetic differentiation of seven small to medium size populations (census sizes 57-305 reproducing trees) and four populations for within-population genetic structures. Genetic differentiation among seven populations was low (Fst = 0.07). We found a strong spatial genetic structure in each of the four populations. Spatial autocorrelation was significant in all populations and its intensity (Sp) was higher than those reported in more southern oak populations. Significant genetic patchiness was revealed by Bayesian structuring and a high amount of spatially aggregated full and half sibs was detected by sibship reconstruction. Meta-analysis of isoenzyme and SSR data extracted from the (GD)(2)database suggested northwards decreasing trend in the expected heterozygosity and an effective number of alleles, thus supporting the central-marginal hypothesis in oak populations. We suggest that the fragmented distribution and location of Finnish pedunculate oak populations at the species' northern margin facilitate the formation of within-population genetic structures. Information on the existence of spatial genetic structures can help conservation managers to design gene conservation activities and to avoid too strong family structures in the sampling of seeds and cuttings for afforestation and tree improvement purposes.Peer reviewe

Jukuri

Helsingin yliopiston digitaalinen arkisto

On Suffix Tree Breadth

Author: Badkobeh Golnaz
Karkkainen Juha
Puglisi Simon
Zhukova Bella
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/09/2017
Field of study

The suffix tree—the compacted trie of all the suffixes of a string—is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes νS(d) can there be at (string) depth d in its suffix tree? We prove ν(n,d)=maxS∈ΣnνS(d) is O((n/d)logn) , and show that this bound is almost tight, describing strings for which νS(d)=d is Ω((n/d)log(n/d)

Goldsmiths Research Online

Crossref

Storage Efficient Substring Searchable Symmetric Encryption

Author: Abouelhoda Mohamed Ibrahim
Cash David
Chase Melissa
Chase Melissa
Faber Sky
Gentry Craig
Karkkainen Juha
Lindell Yehuda
Manber Udi
Publication venue
Publication date: 28/06/2017
Field of study

We address the problem of substring searchable encryption. A single user produces a big stream of data and later on wants to learn the positions in the string that some patterns occur. Although current techniques exploit auxiliary data structures to achieve efficient substring search on the server side, the cost at the user side may be prohibitive. We revisit the work of substring searchable encryption in order to reduce the storage cost of auxiliary data structures. Our solution entails a suffix array based index design, which allows optimal storage cost O (n) with small hidden factor at the size of the string n. We analyze the security of the protocol in the real ideal framework. Moreover, we implemented our scheme and the state of the art protocol [7] to demonstrate the performance advantage of our solution with precise benchmark results

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Cryptology ePrint Archive

Late gadolinium enhanced cardiovascular magnetic resonance of lamin A/C gene mutation related dilated cardiomyopathy

Author: Antila Margareta
Helio Tiina
Holmstrom Miia
Jurkko Raija
Kaartinen Maija
Karkkainen Satu
Kivisto Sari
Koikkalainen Juha
Kuusisto Johanna
Lauerma Kirsi-Maria Susanna
Lotjonen Jyrki
Peuhkurinen Keijo
Reissell Eeva
Publication venue
Publication date: 01/01/2011
Field of study

Peer reviewe

Springer - Publisher Connector

PubMed Central

VTT Research System

Helsingin yliopiston digitaalinen arkisto

Serum Lipidomics Meets Cardiac Magnetic Resonance Imaging: Profiling of Subjects at Risk of Dilated Cardiomyopathy

Author: Antila Margareta
Heliö Tiina
Jurkko Raija
Kaartinen Maija
Karkkainen Satu
Koikkalainen Juha
Kuusisto Johanna
Lauerma Kirsi-Maria Susanna
Lotjonen Jyrki
Oresic Matej
Peuhkurinen Keijo
Reissell Eeva
Seppanen-Laakso Tuulikki
Sysi-Aho Marko
Publication venue
Publication date: 01/01/2011
Field of study

Peer reviewe

Directory of Open Access Journals

PubMed Central

VTT Research System

Helsingin yliopiston digitaalinen arkisto

Evaluating housing quality, health and safety using an Internet-based data collection and response system: a cross-sectional study

Author: A Kullberg
A Lindfors
A Weltner
Aino Nevalainen
Ari Paanala
B Smith
C Maschke
C Shim
CP Weisel
IJ Selikoff
J Brogger
JP Zock
Juha Villman
K Wilson
L Hagerhed-Engman
LK Baxter
M Frisk
M Kilpelainen
M Taubel
M Toivola
Mari Turunen
NE Rosenthal
O Garcia-Algar
O Koskinen
P Kaarakainen
P Ritter
P Suominen
PM Karkkainen
S Darby
Strandell Anna
U Haverinen
Ulla Haverinen-Shaughnessy
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Typically housing and health surveys are not integrated together and therefore are not representative of population health or national housing stocks. In addition, the existing channels for distributing information about housing and health issues to the general public are limited. The aim of this study was to develop a data collection and response system that would allow us to assess the Finnish housing stock from the points of view of quality, health and safety, and also to provide a tool to distribute information about important housing health and safety issues. Methods The data collection and response system was tested with a sample of 3000 adults (one per household), who were randomly selected from the Finnish Population Register Centre. Spatial information about the exact location of the residences (i.e. coordinates) was included in the database inquiry. People could participate either by completing and returning a paper questionnaire or by completing the same questionnaire via the Internet. The respondents did not receive any compensation for their time in completing the questionnaire. Results This article describes the data collection and response system and presents the main results of the population-based testing of the system. A total of 1312 people (response rate 44%) answered the questionnaire, though only 80 answered via the Internet. A third of the respondents had indicated they wanted feedback. Albeit a majority (>90%) of the respondents reported being satisfied or quite satisfied with their residence, there were a number of prevalent housing issues identified that can be related to health and safety. Conclusions The collected database can be used to evaluate the quality of the housing stock in terms of occupant health and safety, and to model its association with occupant health and well-being. However, it must be noted that all the health outcomes gathered in this study are self-reported. A follow-up study is needed to evaluate whether the occupants acted on the feedback they received. Relying solely on an Internet-based questionnaire for collecting data would not appear to provide an adequate response rate for random population-based surveys at this point in time.</p

Crossref

Springer - Publisher Connector

Julkari

Directory of Open Access Journals

PubMed Central

Episode Matching

Author: Dimitris Gunopulos
Gautam Das
Juha Karkkainen
Leszek Gasieniec
Rudolf Fleischer
Publication venue
Publication date: 01/01/1997
Field of study

. Given two words, text T of length n and episode P of length m, the episode matching problem is to find all minimal length substrings of text T that contain episode P as a subsequence. The respective optimization problem is to find the smallest number w, s.t. text T has a subword of length w which contains episode P . In this paper, we introduce a few efficient off-line as well as on-line algorithms for the entire problem, where by on-line algorithms we mean algorithms which search from left to right consecutive text symbols only once. We present two alphabet independent algorithms which work in time O(nm). The off-line algorithm operates in O(1) additional space while the on-line algorithm pays for its property with O(m) additional space. Two other on-line algorithms have subquadratic time complexity. One of them works in time O(nm= log m) and O(m) additional space. The other one gives a time/space trade-off, i.e., it works in time O(n + s +nm log log s= log(s=m)) when additional s..

CiteSeerX

MPG.PuRe